Embedded all relevant feature selection with Random Ferns
نویسنده
چکیده
Many machine learning methods can produce variable importance scores expressing the usability of each feature in context of the produced model; those scores on their own are yet not sufficient to generate feature selection, especially when an all relevant selection is required. Although there are wrapper methods aiming to solve this problem, they introduce a substantial increase in the required computational effort. In this paper I investigate an idea of incorporating all relevant selection within the training process by producing importance for implicitly generated shadows, attributes irrelevant by design. I propose and evaluate such a method in context of random ferns classifier. Experiment results confirm the effectiveness of such approach, although show that fully stochastic nature of random ferns limits its applicability either to small dimensions or as a part of a broader feature selection procedure.
منابع مشابه
Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features
Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...
متن کاملIdentification of Statistically Significant Features from Random Forests
Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not easy to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest. We show e...
متن کاملThe All Relevant Feature Selection using Random Forest
In this paper we examine the application of the random forest classifier for the all relevant feature selection problem. To this end we first examine two recently proposed all relevant feature selection algorithms, both being a random forest wrappers, on a series of synthetic data sets with varying size. We show that reasonable accuracy of predictions can be achieved and that heuristic algorith...
متن کاملA Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems
Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...
متن کاملFeature Selection with the Boruta Package
This article describes a R package Boruta, implementing a novel feature selection algorithm for finding all relevant variables. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorith...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1604.06133 شماره
صفحات -
تاریخ انتشار 2016